Red Wine Quality Exploration by Amnah Samkari

Here we will be exploring Red Wine Quality data set, where we will be checking which chemical properties influence the quality of red wines. The data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. As for the rating, at least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

These properties are:

  1. Fixed Acidity: Most acids involved with wine or fixed or nonvolatile (do not evaporate readily).

  2. Volatile Acidity: The amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.

  3. Citric Acid: Found in small quantities, citric acid can add ‘freshness’ and flavor to wines.

  4. Residual Sugar: The amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet.

  5. Chlorides: The amount of salt in the wine.

  6. Free Sulfur Dioxide: The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and sulfide ion; it prevents microbial growth and the oxidation of wine.

  7. Total Sulfur Dioxide: Amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.

  8. Density: The density of water is close to that of water depending on the percent alcohol and sugar content.

  9. pH: Describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.

  10. Sulphates: A wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant.

  11. Alcohol: the percent alcohol content of the wine.


Univariate Plots Section

In this section, we will get some sense of our data, starting with wines quality and then exploring all properties’ distributions.

Summary and Genreal info

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Quality

We will start investigating our data by the quality’s distribution first. The Quality rating for our data set ranges from 3 to 8, where 3 is closer to 0 (bad quality), and 8 that is close to 10 (good quality).

To make these quality numbers more readable, we will create a rating for each wine and group all wines into three groups as follow: * 0 - 4: Bad * 5 - 6: Average * 7 - 10: Good

Showing the count in percentage in the second figure showed us that almost 82% of red wines are rated average, while less than 4% are considered bad, and a bit less than 15% are the good wines.

Fixed Acidity

Fixed acidity without outliers seems to be right skewed as well, as if all wine tend to have a fixed acidity closer to 7 but not less.

Volatile Acidity

Volatile acidity does not have any kind of known distribution, no indication for anything.

Citric Acid

We can notice from the above figure we have these long spikes, which by taking out outliers will remove them,which inappropriate here, since it does not really reflect the data.

Residual Sugar

Here, due to this long tail, we used log 10 transformation to observe the distribution better, and we can see the residual sugar’s distribution forms a right skewed shape.

Chlorides

Here, and again due to this long tail, we used log 10 transformation to observe the distribution better, and we can see chloride’s distribution forms a bell shape, meaning normal distribution.

Free Sulfur Dioxide

Here we see that the free sulfur dioxide is normally distributed.

Total Sulfur Dioxide

The total sulfur dioxide is right skewed, with 2 very extreme outliers.

Density

pH

pH here we can notice that the pH level is normally distributed.

Sulphates

Sulphates seems kind of normally distributed even after the log10 transformation.

Alcohol

This last one, alcohol here we can observe that alcohol is right skewed but it does not have that long tail, but it got some outliers and one extreme outlier.


Univariate Analysis

What is the structure of your dataset?

The data set contains 1,599 sample of red wines with 12 variables. 11 are the chemical properties of each red wine, while the last one is the experts ratings for its quality.

  • Most samples fall got quality rating of 5 or 6
  • Observing Fixed and Volatile acidity without outliers we could see that they are taking a rectangular shape rather than a normal distribution.
  • Most of the properties are form right skewed shapes, while only pH and density are normally distributed.

What is/are the main feature(s) of interest in your dataset?

I would like to explore how acidity chemicals affect the taste and quality of the wine, and what are the effects of sugar and alcohol to it.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think that the fixed, volatile acidity, residual sugar and alcohol play major role in the quality of the wine.

Did you create any new variables from existing variables in the dataset?

I created the variable rating, which in and ordered factor, to classify the wine based on its quality to the following: 0 - 4: Bad 5 - 6: Average 7 - 10: Good

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Most of the data were right skewed, with only pH and density closer to being normally distributed. The longest tails would go the Residual Sugar, Chlorides, Sulfur Dioxide and Sulphates.


Bivariate Plots Section

This section will be divided into two, the first one will compare all elements to quality, while the second one will take two different elements and investigate their influence on each other.

All Properties VS Quality

Here we will explore 11 properties of the wine and how do they affect its quality.

Fixed Acidity vs. Quality

Fixed acidity does not show any clear relation with the quality.

Volatile Acidity vs. Quality

Volatile acidity is less when quality increases.

Citric Acid vs. Quality

Citric acid does not show any clear relation with the quality.

Residual Sugar vs. Quality

Residual sugar does not show any clear relation with the quality.

Chlorides vs. Quality

Chlorides is less when quality increases.

Free Sulfur Dioxide vs. Quality

Free sulfur dioxide does not show any clear relation with the quality.

Total Sulfur Dioxide vs. Quality

total sulfur dioxide does not show any clear relation with the quality, but its range of amount becomes less when the quality is high.

Density vs. Quality

Fixed acidity shows a weal relation, that is gets less with higher quality.

pH vs. Quality

pH level does not show any clear relation with the quality.

Sulphates vs. Quality

Sulphates shows a very clear relation with the quality, as its amount becomes higher with higher quality.

Alcohol vs. Quality

Here alcohol percentage shows a kind of relation with the quality, as its percentage is bigger with higher quality.

Now after observing and limiting the y axis form some of the figures above, we could notice that there is some relation between the quality and following:

Positive Correlation: * Sulphates * Alcohol

Negative Correlation: * Total Sulfur Dioxide * Chlorides * Volatile Acidity

Now we will try to find other relations between different elements by the correlation matrix.

Correlation Matrix and Other Relations

Here we will get some sense if there is an interesting relationship between any two variables to explore them more.

So from this correlation matrix, the following relations seemed interesting bi-variables:

  • Citric Acid vs. Fixed Acidity
  • Citric Acid vs. Volatile Acidity
  • Total Sulfur Dioxide vs. Free Sulfur Dioxide
  • Density vs. Fixed Acidity
  • Density vs. Chlorides
  • Density vs. Residual Sugar
  • Density vs. Alcohol
  • pH vs. Fixed Acidity
  • pH vs. Volatile Acid

We will explore them in three groups.

Citric Acid

Since Citric acid has a quite big correlation with both the fixed and volatile acidity, let’s explore these two relations

We can observe from above that the higher the fixed acidity, the higher the citric acid, in this might be related that the amount of citric acid needed for the wine taste fresh increases as the fixed acidity increases, or it might suggest that citric acid is a form of fixed acidity. On the other hand,s the citric acid’s relation with the volatile acidity seems somewhat monotonically decreasing, but not strong enough to confirm it though.

Sulfur Dioxide

Here we will check how the total sulfur dioxide and free sulfur dioxide are related.

This figure shows how strong the positive correlation is when the total sulfur dioxide is 50 or less, then this correlation starts getting after that.

Density

Knowing that the density of the wine is primarily determined by the concentration of alcohol, sugar, and other dissolved solids. We will explore how different elements affect the density.

We can observe that both fixed acidity and chlorides have higher impact on the density in contrast of the residual sugar, which despite its increment, the density almost remain the same, meaning that other components are contributing to the taste.

pH

Here I would like to investigate how fixed acidity and volatile acidity affect the pH level of the wine.

As expected, the higher the concentration of any of the acids, the wine tends to be more acidic. But in contrast, volatile acidity did not have any major impacts on pH, which was not expected for me.


Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

We might conclude that less the density of water and volatile acidity, the better the quality of the wine is. In addition to Alcohol level that increase with the quality.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Citric acid correlation with the fixed acidity is strong and positive, which may lead us to think that citric acid might be a from of fixed acid. Because the same strong positive correlation appeared in the total and free sulfur dioxide.

What was the strongest relationship you found?

The relationship between Alcohol and the wine quality, the citric acid and fixed acidity, total and free sulfur dioxide.


Multivariate Plots Section

Motivated by the results obtained from the last section, and other questions, we will explore the data again but we will observe how do they affect quality as well.

Acids and Quality

Starting by acids again, we will check how these acids affect quality and how they related to each other.

We can observe from the first plot here that fixed acidity and citric acid do not have a strong relation with the quality, while the second one shows citric acid with low volatile acidity produce better wine quality. Same result appears in the third plot, suggesting the same result of the citric acid being a form of fixed acidity due to their strong correlation.

Chlorides, Residual Sugar and Density

Now let’s see how these two different tastes affect the density of the wine, and then how do salt and sugar, sweetness and bitterness, affect each other.

The first plot suggest that the chlorides has negative correlation with the wine quality. While the second plot, the residual sugar that did not appear to have any strong correlation with either the density. The last figure does not show any interesting relation between chlorides and residual sugar.

Alcohol, Density vs pH Level

Here will see the affect of both the alcohol and density on the pH, and how this all affect wine quality

Again, alcohol no change of pH here, but again, shows the same relation with quality, the higher alcohol, the better the quality. As for the density vs pH, they do not show any clear interesting pattern.


Multivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The relation of the volatile acidity and citric was very obvious with the quality. The less volatile acid we have, the better the quality is. And examining the pH, good quality pH tend to be higher with higher alcohol level in the wine.

Were there any interesting or surprising interactions between features?

It was interesting how strong the effect of fixed acid on pH is, as its decrements drove the pH to be more balanced very quickly


Final Plots and Summary

Plot One: Density Distribution

Description One: Density Disterbutaion

The pH figure here, and after taking smaller bandwidth, showed a normal distribution for data ranging between 3.0 and 3.6. Taking in mind that the pH scale is between 0 (Acid) and 14 (Base), with the middle point 7 (Water), we conclude that ant type of wine with the different elements contributing to its taste, should maintain this range of pH level.

Plot Two

Description Two

After exploring different elements that might affect the density, the fixed acidity figure was the most interesting and obvious one. We observe these points pattern how do they form a strong positive correlation with the density. This suggest that the more the wine’s density is, we should expect a higher amount of fixed acidity in the wine.

Plot Three

Description Three

This figure here shows that with higher alcohol content, the wine quality gets better. And this increment in Alcohol did not affect the pH level much, meaning that other elements played a role to balance the wine at its average pH level.


Reflection

It might sound weird, but since I have never tried any wine in my life, it was like exploring and trying to imagine. But it was interesting to learn how all these elements affects the quality of the wine. You learn how to form question, and who to investigate data to get the answers you seek. As for the challenges faced, I struggled with how to explore this data, because I did not know if I am asking the correct questions or not. I overcame this problem by reading more about wine, and the affect of each one of them. I think my analysis could be improved if the data explanation contained some figures and visualization that easily demonstrate basic information about the wine elements.